首页> 外文OA文献 >Optimized Deep Neural Networks for Real-Time Object Classification on Embedded GPUs
【2h】

Optimized Deep Neural Networks for Real-Time Object Classification on Embedded GPUs

机译:优化的深度神经网络,用于嵌入式GPU上的实时对象分类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Convolution is the most computationally intensive task of the Convolutional Neural Network (CNN). It requires a lot of memory storage and computational power. There are different approaches to compute the solution of convolution and reduce its computational complexity. In this paper, a matrix multiplication-based convolution (ConvMM) approach is fully parallelized using concurrent resources of GPU (Graphics Processing Unit) and optimized, considerably improving the performance of the image classifiers and making them applicable to real-time embedded applications. The flow of this CUDA (Compute Unified Device Architecture)-based scheme is optimized using unified memory and hardware-dependent acceleration of matrix multiplication. Proposed flow is evaluated on two different embedded platforms: first on an Nvidia Jetson TX1 embedded board and then on a Tegra K1 GPU of an Nvidia Shield K1 Tablet. The performance of this optimized and accelerated convolutional layer is compared with its sequential and heterogeneous versions. Results show that the proposed scheme significantly improves the overall results including energy efficiency, storage requirement and inference performance. In particular, the proposed scheme on embedded GPUs is hundreds of times faster than the sequential version and delivers tens of times higher performance than the heterogeneous approach.
机译:卷积是卷积神经网络(CNN)中计算最密集的任务。它需要大量的存储器存储和计算能力。有多种计算卷积解并降低其计算复杂度的方法。在本文中,基于矩阵乘法的卷积(ConvMM)方法使用GPU(图形处理单元)的并发资源进行了完全并行化并进行了优化,从而大大提高了图像分类器的性能,使其适用于实时嵌入式应用程序。这种基于CUDA(计算统一设备体系结构)的方案的流程使用统一存储器和依赖于硬件的矩阵乘法加速进行了优化。在两个不同的嵌入式平台上评估了建议的流程:首先在Nvidia Jetson TX1嵌入式板上,然后在Nvidia Shield K1平板电脑的Tegra K1 GPU上。将该优化和加速的卷积层的性能与其顺序版本和异构版本进行了比较。结果表明,该方案显着改善了整体结果,包括能效,存储需求和推理性能。特别是,在嵌入式GPU上提出的方案比顺序版本快数百倍,并且性能比异构方法高几十倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号